Skip to content

feat: make parse_url compatible#4413

Open
parthchandra wants to merge 2 commits into
apache:mainfrom
parthchandra:parse_url_2
Open

feat: make parse_url compatible#4413
parthchandra wants to merge 2 commits into
apache:mainfrom
parthchandra:parse_url_2

Conversation

@parthchandra
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #4150.

Rationale for this change

PR #4350 wired parse_url through Comet's serde layer to the upstream datafusion-spark UDFs but marked it Incompatible due to divergences between the WHATWG URL Standard (url crate) and Spark's java.net.URI (RFC 3986). This PR replaces the upstream implementation with a local RFC 3986 regex-based
parser that matches Spark's behavior, promoting parse_url from Incompatible to Compatible.

What changes are included in this PR?

New: native/spark-expr/src/url_funcs/parse_url.rs

  • CometParseUrl and CometTryParseUrl UDFs using RFC 3986 Appendix B regex decomposition instead of the WHATWG url crate
  • Fixes all tracked divergences from Spark:
    1. Empty-string URL with PATH/FILE now returns "" (was NULL)
    2. FILE on URL without explicit path (http://host?foo=bar) returns ?foo=bar (was /?foo=bar)
    3. PATH on trailing slash (http://host/) returns "/" (was "")
    4. 3-arg QUERY with key returns raw percent-encoded value (was decoded)
  • Additional correctness fixes found during adversarial review:
    • Query values containing = (e.g., ?a=b=c) now correctly return b=c
    • 3-arg call with non-QUERY part (e.g., parse_url(url, 'HOST', 'key')) returns NULL
    • Empty port (http://host:/path) strips trailing colon
    • Empty authority (http:///path) returns NULL instead of ""
  • ANSI mode: raises [INVALID_URL] error for malformed URLs (spaces, control chars, missing scheme with ://)

Modified: native/core/src/execution/jni_api.rs

  • Registers CometParseUrl/CometTryParseUrl from spark-expr instead of upstream SparkParseUrl/SparkTryParseUrl

Modified: spark/.../serde/url.scala

  • Removes Incompatible override and incompatibleReason (now Compatible by default)

Modified: SQL test files

  • parse_url.sql: expanded from 4 fallback queries to 30+ native-execution queries covering all components, edge cases, and divergence fixes
  • parse_url_ansi.sql: enables previously-ignored ANSI error tests with expect_error(INVALID_URL)

How are these changes tested?

  • 25 Rust unit tests in parse_url.rs covering all URL components, divergence fixes, error handling, null propagation, and edge cases (query values with =, 3-arg non-QUERY, empty port, empty authority, IPv6, malformed URLs)
  • SQL file tests (parse_url.sql, parse_url_ansi.sql) verified against Spark on both spark-4.0 and spark-4.1 profiles
  • Adversarial code review identified and fixed 4 additional correctness issues before merge

parthchandra and others added 2 commits May 22, 2026 15:06
Spark 3.4 error messages don't include the [INVALID_URL] error class
prefix that Spark 4.x uses. Use the URL value itself as the pattern
since it appears in both versions' error messages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@parthchandra parthchandra marked this pull request as ready for review May 26, 2026 22:49
@parthchandra parthchandra requested a review from andygrove May 26, 2026 22:50
@andygrove
Copy link
Copy Markdown
Member

Thanks. This is a good improvement, but I think there are still a few remaining gaps before we can claim full compatibility.

Ran the PR locally against Spark on parth/parse_url_2 and found four behavioral cases where Comet still diverges from Spark's java.net.URI. Each is small but reproducible. With the serde now defaulting to Compatible, it would be good to capture these in sql-tests/expressions/url/ either as fixes, or as query ignore(<tracking-issue>) placeholders so future readers can find them.

Case Query Spark Comet
Regex metachar in 3-arg query key parse_url('http://h/p?abc=1', 'QUERY', '.bc') 1 null
Non-digit port parse_url('http://host:abc/', 'HOST') null host:abc
Backslash in URL parse_url('http://host/p\q', 'PATH') null /p\q
Unbalanced IPv6 bracket parse_url('http://[::1/path', 'AUTHORITY') null [::1

Quick notes on each:

  1. Spark builds the key matcher as Pattern.compile("(&|^)" + key + "=([^&]*)") with no escaping. . in the key matches any character. extract_query_value in parse_url.rs does literal split('&') matching.
  2. java.net.URI rejects non-digit ports. extract_host in parse_url.rs only strips the trailing :NNNN when the port is all digits, otherwise returns the unstripped string.
  3. has_invalid_uri_chars in parse_url.rs only flags space, { } < >, and bytes < 0x20. java.net.URI rejects a wider set (backslash, double quote, caret, etc.).
  4. extract_host returns None for an unbalanced bracket, but the surrounding regex still captures an authority and path, so AUTHORITY, PATH, and FILE come back non-null while Spark's URI throws and returns null everywhere.

Happy to share the four SQL fixture files if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support expressions already implemented in datafusion-spark crate

2 participants